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Abstract — We initiate a systematic study to help distinguish 
a special group of online users, called hidden paid posters, or 
termed "Internet water army" in China, from the legitimate 
ones. On the Internet, the paid posters represent a new type 
of online job opportunity. They get paid for posting comments 
and new threads or articles on different online communities 
and websites for some hidden purposes, e.g., to influence the 
opinion of other people towards certain social events or business 
markets. Though an interesting strategy in business marketing, 
paid posters may create a significant negative effect on the online 
communities, since the information from paid posters is usually 
not trustworthy. When two competitive companies hire paid 
posters to post fake news or negative comments about each other, 
normal online users may feel overwhelmed and find it difficult to 
put any trust in the information they acquire from the Internet. 
In this paper, we thoroughly investigate the behavioral pattern 
of online paid posters based on real-world trace data. We design 
and validate a new detection mechanism, using both non-semantic 
analysis and semantic analysis, to identify potential online paid 
posters. Our test results with real-world datasets show a very 
promising performance. 

Index Terms — Online Paid Posters, Behavioral Patterns, De- 
tection 

I. Introduction 

According to China Internet Network Information Center 
(CNNIC) 1 6], there are currently around 457 million Internet 
users in China, which is approximately 35% of its total 
population. In addition, the number of active websites in China 
is over 1.91 million. The unprecedented development of the 
Internet in China has encouraged people and companies to 
take advantage of the unique opportunities it offers. One core 
issue is how to make use of the huge online human resource to 
make the information diffusion process more efficient. Among 
the many approaches to e-marketing (4), we focus on online 
paid posters used extensively in practice. 

Working as an online paid poster is a rapidly growing job 
opportunity for many online users, mainly college students 
and the unemployed people. These paid posters are referred 
to as the "Internet water army" in China because of the 
large number of people who are well organized to "flood" 
the Internet with purposeful comments and articles. This new 
type of occupation originates from Internet marketing, and it 
has become popular with the fast expansion of the Internet. 
Often hired by public relationship (PR) companies, online 
paid posters earn money by posting comments and articles 



on different online communities and websites. Companies are 
always interested in effective strategies to attract public atten- 
tion towards their products. The idea of online paid posters 
is similar to word-of-mouth advertisement. If a company hires 
enough online users, it would be able to create hot and trending 
topics designed to gain popularity. Furthermore, the articles 
or comments from a group of paid posters are also likely 
to capture the attention of common users and influence their 
decision. In this way, online paid posters present a powerful 
and efficient strategy for companies. To give one example, 
before a new TV show is broadcast, the host company might 
hire paid posters to initiate many discussions on the actors or 
actresses of the show. The content could be either positive or 
negative, since the main goal is to attract attention and trigger 
curiosity. 

We would like to remark here that the use of paid posters 
extends well beyond China. According to a recent news 
report in the Guardian 0, the US military and a private 
corporation are developing a specific software that can be used 
to post information on social media websites using fake online 
identifications. The objective is to speed up the distribution of 
pro-American propaganda. We believe that it would encourage 
other companies and organizations to take the same strategy to 
disseminate information on the Internet, leading to a serious 
problem of spamming. 

However, the consequences of using online paid posters are 
yet to be seriously investigated. While online paid posters 
can be used as an efficient business strategy in marketing, 
they can also act in some malicious ways. Since the laws 
and supervision mechanisms for Internet marketing are still 
not mature in many countries, it is possible to spread wrong, 
negative information about competitors without any penalties. 
For example, two competitive companies or campaigning 
parties might hire paid posters to post fake, negative news or 
information about each other. Obviously, ordinary online users 
may be misled, and it is painful for the website administrators 
to differentiate paid posters from the legitimate ones. Hence, 
it is necessary to design schemes to help normal users, 
administrators, or even law enforcers quickly identify potential 
paid posters. 

Despite the broad use of paid posters and the damage they 
have already caused, it is unfortunate that there is currently no 
systematic study to solve the problem. This is largely because 
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online paid posters mostly work "underground" and no public 
data is available to study their behavior. Our paper is the first 
work that tackles the challenges of detecting potential paid 
posters. We make the following contributions. 

1) By working as a paid poster and following the in- 
structions given from the hiring company, we identify 
and confirm the organizational structure of online paid 
posters similar to what has been disclosed before ifTTIl . 

2) We collect real- world data from popular websites regard- 
ing a famous social event, in which we believe there are 
potentially many hidden online paid posters. 

3) We statistically analyze the behavioral patterns of poten- 
tial online paid posters and identify several key features 
that are useful in their detection. 

4) We integrate semantic analysis with the behavioral pat- 
terns of potential online paid posters to further improve 
the accuracy of our detection. 

The rest of the paper is organized as follows. We present 
more background information and identify the organizational 
structure of online paid posters in Section [TTJ Section [III 
presents our data collection method. In Section [TV] we sta- 



tistically analyze non- semantic behavioral features of online 
paid posters. In Section |V| we introduce a simple method 
for semantic analysis that can greatly help the detection of 
online paid posters. In Section |Vl| we introduce our detection 
method and evaluation results. Related work is discussed in 



Section VII We conclude the paper in Section VIII 



II. How Do Online Paid Posters Work? 
A. Typical Cases 

To better understand the behavior and the social impact 
of online paid posters, we investigated several social events, 
which are likely to be boosted by online paid posters. We 
introduce two typical cases to illustrate how online paid posters 
could be an effective marketing strategy, in either a positive 
or a negative manner. 

Example 1: On July 16, 2009, someone posted a thread 
with blank content and a title of "Junpeng Jia, your mother 
asked you to go back home for dinner!" on a Baidu Post Com- 
munity of World of Warcraft, a Chinese online community 
for a computer game lfT4l . In the following two days, this 
thread magically received up to 300,621 replies and more 
than 7 million clicks. Nobody knew why this meaningless 
thread would get so much attention. Several days later, a PR 
company in Beijing claimed that they were the people who 
designed the whole event, with an intention to maintain the 
popularity of this online computer game during its temporary 
system maintenance. They employed more than 800 online 
paid posters using nearly 20,000 different user IDs. In the 
end, they achieved their goal- even if the online game was 
not temporarily available, the website remained popular during 
that time and it encouraged more normal users to join. This 
case not only shows the existence of online paid posters, but 
also reveals the efficiency and effectiveness of such an online 
activity. 



Example 2: On July 17, 2009, a Chinese IT company Qihu 
360, also known as 360 for short, released a free anti-virus 
software and claimed that they would provide permanent anti- 
virus service for free. This immediately made 360 a super 
star in anti- virus software market in China. Nevertheless, on 
July 29 an article titled "Confessions from a retired employee 
of 360" appeared in different websites. This article revealed 
some inside information about 360 and claimed that this 
company was secretly collecting users' private data. The links 
to this post on different websites quickly attracted hundreds of 
thousands of views and replies. Though 360 claimed that this 
article was fabricated by its competitors, it was sufficient to 
raise serious concerns about the privacy of normal users. Even 
worse, in late October, similar articles became popular again 
in several online communities. 360 wondered how the articles 
could be spread so quickly to hundreds of online forums in a 
few days. It was also incredible that all these articles attracted 
a huge amount of replies in such a short time period. 

In 2010, 360 and Tencent, two main IT companies in China, 
were involved in a bigger conflict. On September 27, 360 
claimed that Tencent secretly scans user's hard disk when its 
instant message client, QQ, is used. It thus released a user 
privacy protector that could be used to detect hidden operations 
of other software installed on the computer, especially QQ. In 
response, Tencent decided that users could no longer use their 
service if the computer had 360's software installed. This event 
led to great controversy among the hundreds of thousands of 
the Internet users. They posted their comments on all kinds 
of online communities and news websites. Although both 360 
and Tencent claimed that they did not hire online paid posters, 
we now have strong evidence suggesting the opposite. Some 
special patterns are definitely unusual, e.g., many negative 
comments or replies came from newly registered user IDs 
but these user IDs were seldom used afterwards. This clearly 
indicates the use of online paid posters. 

Since a large amount of comments/articles regarding this 
conflict is still available in different popular websites, we in 
this paper focus on this event as the case study. 

B. Organizational Structure of Online Paid Posters 

1) Basic Overview: These days, some websites, such as 
shuijunwang.com I2T1 . offer the Internet users the chance of 
becoming online paid posters. To better understand how online 
paid posters work, Cheng, one of the co-authors of this paper, 
registered on such a website and worked as a paid poster. We 
summarize his experience to illustrate the basic activities of 
an online paid poster. 

Once online users register on the website with their Inter- 
net banking accounts, they are provided with a mission list 
maintained by the webmaster. These missions include posting 
articles and video clips for ads, posting comments, carrying 
out Q&A sessions, etc., over other popular websites. Normally, 
the video clips are pre-prepared and the instructions for writing 
the articles/comments are given. There are project managers 
and other staff members who are responsible for validating 
the accomplishment of each poster's mission. Paid posters 
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are rewarded only after their assignments pass the validation. 
An assignment is considered a "fail" if, for example, the 
posted articles or contents are deleted by other websites' 
administrators. In addition, there are some regular rules for the 
paid posters. For example, articles should be posted at different 
forums or at different sections of the same forum; Comments 
should not be copied and pasted from other users' replies; The 
mission should be finished on time (normally within 3 hours), 
and so on. 

Although the mission publisher has regulations for paid 
posters, they may not strictly follow the rules while completing 
their assignments, since they are usually rewarded based on 
the number of posts. That is why we can find some special 
behavioral patterns of potential paid posters through statistical 
analysis. 

2 ) Management of Paid Posters: Occasionally, PR compa- 
nies may hire many people and have a well-organized structure 
for some special events. Due to the large number of user IDs 
and different post missions, such an online activity needs to be 
well orchestrated to fulfill the goal. Our first-hand experience 
confirms an organizational structure of online paid posters as 
similar to that disclosed in ifTTl . When a mission is released, 
an organization structure as shown in Fig. [T] is formed. The 
meaning or role of each component is as follows. 

- Mission represents a potential online event to be ac- 
complished by online paid posters. Usually, 1 project 
manager and 4 teams, namely the trainer team, the poster 
team, the public relationship team, and the resource team 
are assigned to a mission. All of them are employed by 
PR companies. 

- Project manager coordinates the activities of the four 
teams throughout the whole process. 

- Trainer team plans schedule for paid posters, such as 
when and where to post and the distribution of shared 
user IDs. Sometimes, they also accept feedback from 
paid posters. 

- Posters team includes those who are paid to post infor- 
mation. They are often college students and unemployed 
people. For each validated post, they get 30 cents or 50 
cents. The posters can be grouped according to different 
target websites or online communities. They often have 
their own online communities for sharing experience and 
discussing missions. 

- Public relationship team is responsible for contacting 
and maintaining good relationship with other webmas- 
ters to prevent the posted messages from being deleted. 
Possibly, with some bonus incentives, these webmasters 
may even highlight the posts to attract more attention. 
In this sense, those webmasters are actually working for 
the PR companies. 

- Resources team is responsible for collecting/creating a 
large amount of online user IDs and other registration 
information used by the paid posters. Besides, they 
employ good writers to prepare specific post templates 
for posters. 



Mission 



Manager 




Trainer Posters Resources „ , Pu ic , . 

J { _ ) I J I Relationship, 

Fig. 1. Management structure of online paid posters 

III. Data Collection 

In this paper, we use the second example introduced in 
Section [TTJ the conflict between 360 and Tencent, as the case 
study. We collected news reports and relevant comments re- 
garding this special social event. While the number of websites 
hosting relevant content is large, most posts could be found 
at two famous Chinese news websites: Sina.com lf22l and 
Sohu.com [23], from which we collected enough data for our 
study. We call the data collected from Sina.com Sina dataset 
and will use it as the training data for our detection model. 
The data collected from Sohu.com is called Sohu dataset and 
it will be used as the test data for our detection method. 

We searched all news reports and comments from Sina.com 
and Sohu.com over the time period from September 10, 2010 
to November 21, 2010. As a result, we found 22 news reports 
in Sina.com and 24 news reports in Sohu.com. For each 
news report, there were many comments. For each comment, 
we recorded the following relevant information: Report ID, 
Sequence No., Post Time, Post Location, User ID, Content, 
and Response Indicator, the meanings of which are explained 
in Table U 



Field 


Meaning 


Report ID 


The ID of news report that the 
comment belongs to 


Sequence No. 


The order of the comment w.r.t. the 
corresponding news report 


Post Time 


The time when the comment is 
posted 


Post Location 


The location from where the com- 
ment was posted 


User ID 


The user ID used by the poster 


Content 


The content of the comment 


Response Indicator 


Whether the comment is a new 
comment or a reply to another 
comment 



TABLE I: Recorded information for each comment 



We were faced with several hurdles during the data collec- 
tion phase. At the outset, we had to tackle the difficulty of 
collecting data from dynamic web pages. Due to the appli- 
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cation of AJAX |fT9l on most websites, comments are often 
displayed on web pages generated on the fly, and thus it was 
hard to retrieve the data from the source code of the web page. 
To be specific, after the client Internet explorer successfully 
downloads a HTML page, it needs to send further requests to 
the server to get the comments, which should be shown in the 
comment section. Most of the web crawlers that retrieve the 
source code do not support such a functionality to obtain the 
dynamically generated data. To avoid this problem, we adopted 
Gooseeker [11], a powerful and easy-to-use software suitable 
for the above task. It allows us to indicate which part of the 
page should be stored in the disk and then it automatically 
goes through all the comment information page by page. In 
our case study, due to the popularity and the broad impact 
of this social event, some news reports ended up with more 
than 100 pages of comments, with each page having 15 to 20 
comments. We stored all the comments of one web page in 
a XML file. We then wrote a program in Python to parse all 
files to get rid of the HTML tags. We finally stored all the 
required information in the format described in Table [T| into 
two separate files depending on whether the comments were 
from Sina.com or from Sohu.com. 

We then needed to clean up the data caused by some bugs on 
the server side of Sina and Sohu. We noticed that the server 
occasionally sent duplicate pages of comments, resulting in 
duplicate data in our final dataset. For example, for a certain 
report, we recorded more than 10, 000 comments, with nearly 
5,000 duplicate comments. After removing these duplicate 
data, we got 53,723 records in Sina and 115,491 records in 
Sohu. There was a special type of comments sent by mobile 
users with cellular phones. The user IDs of mobile users, 
no matter where they come from, are all labeled as "Mobile 
User" on the web. There is no way to tell how many users 
are actually behind this unique user ID. For this reason, we 
have to remove all comments from "Mobile User". We also 
needed to remove users who only posted very few comments, 
since it is hard to tell whether they are normal users or paid 
posters, even with manual check. To this end, we removed 
those users who only posted less than 4 comments. Finally, 
Sohu allows anonymous posts (i.e., a user can post comments 
without needing to register for user ID). Since the real number 
of users behind the anonymous posts is unknown, we excluded 
these anonymous posts from our dataset. 

After the above steps, our Sina dataset included 552 users 
and 20, 738 comments, and our Sohu dataset included 223 
users and 1, 220 comments. It is very interesting to see that the 
two datasets seem to have largely different statistical features, 
e.g., the average number of comments per user in the Sina 
dataset is about 37.6 while that in the Sohu dataset is only 
5.5. One main reason is that Sohu allows anonymous posts, 
while Sina does not. 

A big question that we aim to answer: can we really build 
an effective detection system that is trained with one dataset 
and later works well for other datasets? We will disclose our 
findings in the following sections. 



IV. Non-Semantic Analysis 

The goal of our non-semantic analysis is to find out the 
objective features that are useful in capturing potential paid 
posters' behavior. We use Sina dataset as our training data 
and thus we only perform statistical analysis on this dataset. 

First of all, we need to find the ground truth from the data: 
who are the paid posters? Based on our working experience as 
a paid poster, we manually selected 70 "potential paid posters" 
from the 552 users, after reading the contents of their posts 
(many comments are meaningless or contradicting). We use 
the word potential to avoid the non-technical argument about 
whether a manually selected paid poster is really a paid poster. 
Any absolute claim is not possible unless a paid poster admits 
to it or his employer discloses it, both of which are unlikely 
to happen. We stress that most detection mechanisms, such as 
email spam detection or forum spam detection |20lL face the 
same problem, and the argument whether an email should be 
really considered as a spam is usually beyond the technical 
scope. 

After manually selecting the potential paid posters, we next 
perform statistical analysis to investigate objective features 
that are useful in capturing the potential paid posters' special 
behavior. We mainly test the following four features: percent- 
age of replies, average interval time of posts, the number of 
days the user remains active and the number of news reports 
that the user comments on. In the following, we use N n and 
N p to denote the number of normal users and the number of 
potential paid posters who meet the test criterion, respectively. 
Additionally, we use P n and P p to denote the percentage of 
normal users and the percentage of potential paid posters who 
meet the test criterion, respectively. 

A. Percentage of Replies 

In this feature, we test whether a user tends to post new 
comments or reply to others' comments. We conjecture that 
potential paid posters may not have enough patience to read 
others' comments and reply. Therefore, they may create more 
new comments. Table [TT] shows the statistical result and Fig. [2] 
shows respective graphs, where p represents the ratio of 
number of replies over the number of total comments from 
the same user. 



Criterion 




Pn 


N p 


P P 


p<=0.5 


121 


26.77% 


59 


84.29% 


p > 0.5 


331 


73.23% 


11 


15.71% 



TABLE II: The percentage of replies 



Based on the results, 59 or 84.3% potential paid posters have 
less than 50% of posts being replies. In contrast, most normal 
users (73.2%) posted more replies than new comments. This 
observation confirms our conjecture that potential paid posters 
are more likely to post new comments instead of reading and 
replying to others' comments. 
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(a) The percentage of replies from nor- (b) The percentage of replies from 
mal users potential paid posters 



Fig. 2. The percentage of replies from normal users and potential paid posters 



B. Average Interval Time of Posts 

We calculate the average interval time between two consec- 
utive comments from the same user. Note that it is possible 
for a user to take a long break (e.g., several days) before 
posting messages again. To alleviate the impact of long break 
times, for each user, we divide his/her active online time into 
epochs. Within each epoch, the interval time between any two 
consecutive comments cannot be larger than 24 hours. We 
calculate the average interval time of posts within each epoch, 
and then take the average again over all the epochs. 

Intuitively, normal users are considered to be less aggressive 
when posting comments while paid posters care more about 
finishing their jobs as soon as possible. This implies that the 
average interval time of posts from paid posters should be 
smaller. Table III shows the statistical results and Fig. [3] shows 
the corresponding graphs. 



Interval time (Second) 


N n 


Pn 


N p 


P P 


Cpl50 


103 


22.91% 


35 


50.00% 


150-300 


153 


33.94% 


20 


28.57% 


300450 


93 


20.67% 


10 


14.28% 


45C-600 


41 


9.16% 


3 


4.29% 


60C-750 


35 


7.83% 





0.00% 


75C-900 


11 


2.52% 


1 


1.43% 


> 900 


13 


2.97% 


1 


1.43% 



TABLE III: The average interval time of posts 

Based on the above result, 50% of potential paid posters post 
comments with interval time less than 2.5 minutes while 23% 
of normal users post at such a speed. Nearly 80% potential 
paid posters post comments with interval time less than 5 
minutes while only 57% normal users post at this speed. From 
the figure, we can easily see that the potential paid posters are 
more likely to post in a very short time period. This matches 
our intuition that paid posters only care about finishing their 
jobs as soon as possible and do not have enough interest to 
get involved in the online discussion. 

We observed that some potential paid posters also post 
messages in a relatively slow speed (the interval time is larger 
than 750 seconds). There is one main explanation for the 
existence of these "outliers". As mentioned earlier, the trainer 
team may enforce rules that the paid posters need to follow. 




(a) The average interval time of posts 
from normal users 



450~600_ 
4.29% 



600~750 750-900 >900 
0.00%_, 4-43% 1.43% 




150-300. 
28.57% 



(b) The average interval time of posts 
from potential paid posters 

Fig. 3. The average interval time of posts from normal users and potential 
paid posters 



For example, identical replies should not appear more than 
twice in a same news report or within a short time period. Such 
rules are made to keep the paid posters from being detected 
easily. If a paid poster follows these tactics, he/she may have a 
statistical feature similar to that of a normal user. Nevertheless, 
it seems that the majority of potential paid posters did not 
follow the rules strictly. 

C. Active Days 

We analyzed the number of days that a user remains active 
online. We divided the users into 7 groups based on whether 
they stayed online for 1, 2, 3, 4, 5, 6 days and more than 6 days, 
respectively. According to our experience as a paid poster, 
potential paid posters usually do not stay online using the 
same user ID for a long time. Once a mission is finished, a 
paid poster normally discards the user ID and never uses it 
again. When a new mission starts, a paid poster usually uses 
a different user ID, which may be newly created or assigned 



by the resource team. Table [IV] shows the statistical result and 
Fig. [4] shows the corresponding graphs. 



No. of active days 


N n 


Pn 


N p 


P P 


1 


255 


56.33% 


43 


61.43% 


2 


99 


21.91% 


18 


25.71% 


3 


54 


11.95% 


8 


11.43% 


4 


28 


6.19% 


1 


1.43% 


5 


11 


2.52% 





0.00% 


6 


3 


0.66% 





0.00% 


> 6 


2 


0.44% 





0.00% 



TABLE IV: The number of active days 



According to statistical result, the percentage of potential 



6 



5 6 >6 




(a) The number of active days of nor- 
mal users 




(b) The number of active days of po- 
tential paid posters 



Fig. 4. The number of active days of normal users and potential paid posters 



paid posters and the percentage of normal users are almost 
the same in the groups that remain active for 1,2,3, and 4 
days. Nevertheless, about 4% of normal users keep taking 
part in the discussion for 5 or more days, while we found 
no potential paid posters stayed for more than 4 days. This 
evidence suggests that potential paid posters are not willing 
to stay for a long time. They instead tend to accomplish their 
assignments quickly and once it is done, they would not visit 
the same website again. 

D. The Number of News Reports 

We studied the number of news reports for which a user has 
posted comments. We divided the users into 7 groups based 
on whether they have commented on 1,2,3,4,5,6 or more 
news reports, respectively. Table [V] shows the statistical result 
and Fig. [5] shows the corresponding graphs. 



No. of News Reports 


N n 


Pn 


N p 


P P 


1 


200 


44.25% 


31 


44.29% 


2 


114 


25.22% 


20 


28.57% 


3 


72 


15.93% 


9 


12.85% 


4 


39 


8.63% 


5 


7.14% 


5 


15 


3.32% 


3 


4.29% 


6 


8 


1.77% 


1 


1.43% 


> 6 


4 


0.88% 


1 


1.43% 



TABLE V: The number of news reports 



According to the result, the potential paid posters and 
normal users have similar distribution with respect to the 
number of commented news reports. We originally conjectured 
that paid posters might have a larger number of news reports 
that they post comments to. While normal users might not 




(a) The number of news reports that a 
normal user has commented 




(b) The number of news reports that a 
potential paid poster has commented 



Fig. 5. The number of news reports that a user has commented 



be interested in reports that are not well written or not 
interesting, paid posters care less about the contents of the 
news. Nevertheless, we did not find strong evidence to support 
this conjecture in the Sina dataset. This indicates that the 
number of commented news reports alone may not be a good 
feature for the detection of potential paid posters. 

E. Other Observations 

We also discuss other possible features of potential paid 
posters. These observations come from our working experience 
as a paid poster. Although we cannot find sufficient evidence 
in the Sina dataset, we discuss these features as they can be 
beneficial for future research on this topic. 

First, there may be some pattern in geographic distribution 
of online paid posters. We performed statistical study on the 
Sina dataset, but found that both normal users and potential 
paid posters are mainly located in the center and the south re- 
gions of China. While the two companies involved in the event, 
Tencent and 360, are located in the province of Guang'dong 
and Beijing, respectively, we found no relationship between 
the locations of potential paid posters and the locations of the 
two companies. Fig. [6] shows the geographic distribution of 
normal users and potential paid posters, with the darker color 
representing more users. The figure does not exhibit a clear 
pattern to distinguish potential paid posters from normal users. 

Second, the same user ID appears at different geographical 
locations within a very short time period. This is a clear 
indication of paid poster. Normal users are not able to move to 
a different city in a few minutes or hours, but paid posters can 
because their user IDs may be assigned dynamically by the 
resource team. We identified this possible feature for analysis 
but could not find sufficient evidence in the Sina dataset. 
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(a) The geographical distribution of (b) The geographical distribution of 
normal users potential paid posters 

Fig. 6. The geographic distribution of users, with darker color representing 
more users 



Third, there might exist contradicting comments from paid 
posters. The reason is that they are paid to post without any 
personal emotion. It is their job. Sometimes, they just post 
comments without carefully checking their content. Neverthe- 
less, this feature requires the detection system to have enough 
intelligence to understand the meaning of the comments. 
Incorporating this feature into the system is challenging. 

Fourth, paid posters may post replies that have nothing to 
do with the original message. To earn more money, some paid 
posters just copy and paste existing posts and simply click the 
reply button to increase the total number of posts. They do not 
really read the news reports or others' comments. Again, this 
feature is hard to implement since it requires high intelligence 
for the detection system. 

V. Semantic Analysis 

An important criterion in our manual identification of a 
potential paid poster is to read his/her comments and make a 
choice based on common sense. For example, if a user posted 
meaningless messages or messages contradicting each other, 
the user is very likely to be a paid poster. Nevertheless, it is 
very hard to integrate such human intelligence into a detection 
system. In this section, we propose a simple semantic analysis 
method that is demonstrated to be very effective in detecting 
potential paid posters. 

While it is hard to design a detection system that under- 
stands the meaning of a comment, we observed that potential 
paid posters tend to post similar comments on the web. In 
many cases, a potential paid poster may copy and paste 
existing comments with slight changes. This provides the 
intuition for our semantic analysis technique. 

Our basic idea is to search for similarity between comments. 
To do this, we first need to overcome the special difficulty 
in splitting a Chinese sentence into words. Unlike English 
sentences that have a space between words, many languages 
in Asia such as Chinese and Japanese depend on context to 
determine words. They do not have space between words and 
how to split a sentence is left to the readers. We used a famous 
Chinese splitting software, called ICTCLAS2011 021 to cut a 
sentence into words. For a given sentence, the software outputs 
its content words and stop words 0. Simply put, content 



words are words that have an independent meaning, such as 
noun, verb, or adjective. They have a stable lexical meaning 
and should express the main idea of a sentence. Stop words are 
words that do not have a specific meaning but have syntactic 
function in the sentence to make it grammatically correct. Stop 
words thus should be filtered out from further processing. 

The above step translates a sentence into a list of content 
words. For a given pair of comments, we compare the two lists 
of content words. As mentioned before, a paid poster may 
make slight changes before posting two similar comments. 
Therefore, we may not be able to find an exact match between 
the two lists. We first find their common content words, and 
if the ratio of the number of common content words over the 
length of the shorter content word list is above a threshold 
value (e.g., 80% in our later test), we conclude that the two 
comments are similar. If a user has multiple pairs of similar 
comments, the user is considered a potential paid poster. Note 
that similarity of comments is not transitive in our method. 

We found that a normal user might occasionally have two 
identical comments. This may be caused by the slow Internet 
access, due to which the user presses the submit button twice 
before his/her post is displayed. Our manual check of these 
users confirmed that they are normal users, based on the 
content they posted. To reduce the impact of the "unusual 
behavior of normal users", we set the threshold of similar 
pairs of comments to 3. This threshold value is demonstrated 
to be effective in addressing the above problem. 

While there are many other complex semantic anal- 
ysis methods to represent the similarity between two 
texts |[T5l lfT2l fTTl fT6lL we believe that comments are much 
shorter than articles and therefore a simple method as above 
would be good enough. This is demonstrated later in Sec- 
tion [vn 

We performed the semantic analysis over the training data, 
Sina dataset. The result is shown in Table VI and Fig. [7] shows 
the corresponding graphs. We list the statistic result regarding 
the number of similar comment pairs of each user. From this 
table, we can see that normal users tend to post different 
comments. 79.6% of them do not have any similar comment 
pairs. In sharp contrast, 78.6% of the potential paid posters 
have more than 5 similar comment pairs! 



Similar Pairs of Comments 


N n 


Pn 


N p 


P P 





360 


79.65% 


4 


5.71% 


1 


38 


8.41% 


3 


4.29% 


2 


8 


1.77% 


2 


2.86% 


3 


19 


4.20% 


4 


5.71% 


4 


7 


1.55% 


2 


2.86% 


5 


1 


0.22% 





0.00% 


>= 6 


19 


4.20% 


55 


78.57% 



TABLE VI: Semantic analysis of Sina dataset 



VI. Classification 

The objective of our classification system is to classify each 
user as a potential paid poster or a normal user using the 
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features investigated in Section |IV| and Section |V| According 
to the statistical and semantic analysis results, we found that 
any single feature is not sufficient to locate potential paid 
posters. Therefore, we use and compare the performance of 
different combinations of the five features discussed in the 
previous two sections in our classification system. We model 
the detection of potential paid posters as a binary classification 
problem and solve the problem using a support vector machine 
(SVM) 0. 

We used a Python interface of LIB SVM [5 1 as the tool for 
training and testing. By default, LIB SVM adopts radial basis 
function [7] and a 10-fold cross-validation method to train the 
data and obtain a classifier. After training the classifier with 
the Sina dataset, we used the classifier to test the Sohu dataset. 

Before evaluating the performance of our classifier on the 
test dataset, we first manually identify the potential paid 
posters in the Sohu dataset, by reading the contents of their 
posts. The number of manually selected paid posters and 
normal users are listed in Table IVIII 





Sina.com 


Sohu.com 


Paid Poster 


70 


82 


Normal Users 


452 


141 



TABLE VII: Number of paid posters and normal users in 
Sina.com and Sohu.com 



We evaluate the performance of the classifier using the four 
metrics: precision, recall, F -measure and accuracy, which are 
defined in Table IVIIII Note that these four metrics are well 
known and broadly used measures in the evaluation of a clas- 
sification system l24l . In the table, benchmark result means 
the result obtained with manual identification of potential paid 
posters. 





Classified Result 


Normal User 


Paid Poster 


Benchmark 
Result 


Normal User 


True Negative 


False Positive 


Paid Poster 


False Negative 


True Positive 



Precision 



Recall 



- measure 



Accuracy ■ 



TruePositive 
TruePositive + FalsePositive 

TruePositive 
TruePositive + FalseNegative 
Precision * Recall 

: 2 * 

Precision + Recall 
TrueN egative + TruePositive 
Total Number of 'Users 



TABLE VIII: Metrics to evaluate the performance of a classi- 
fication system 



A. Classification without Semantic Analysis 

We firstly focus on the classification only using statistical 
analysis results. Based on the statistical analysis in Section [TV 
we notice that the first two features, ratio of replies and 
average interval time of posts, show great difference between 
the potential paid posters and the normal users. Therefore, we 
train the SVM model using the Sina dataset with those two 
features. We test the model with the Sohu dataset to see the 
performance. As a comparison, we also train the model using 
all the four non-semantic features. The results are listed in 
Table Hxl 



Metrics 


2-Feature 


4-Feature 


5 -Feature 


True Negative 


141 


108 


138 


False Positive 





33 


3 


False Negative 


80 


50 


22 


True Positive 


2 


32 


60 


Precision 


100.00% 


49.23% 


95.24% 


Recall 


2.43% 


39.02% 


73.17% 


F-measure 


4.76% 


43.54% 


82.76% 


Accuracy: 


64.12% 


62.78% 


88.79% 



TABLE IX: Test results with non- semantic and semantic 
features 




(a) The number of similar pairs of (b) The number of similar pairs of 
comments posted by normal users comments posted by potential paid 

posters 



Fig. 7. The number of similar pairs of comments posted by a user 



For the 2-feature test, although the precision is 100%, only 
2 out of 82 potential paid posters are correctly identified by 
the classifier. It will be unacceptable if we want to use this 
classifier to find out paid posters. These results suggest that the 
first two features lead to significant bias in our classification, 
and we need to add more features to our classifier. 

When we use the four non-semantic features as vectors to 
train the SVM model and do the same test on Sohu dataset, 
the results are much improved except the precision and the 
accuracy. Nevertheless, we can see that the values of false 
positive and false negative are too high to claim acceptable 
performance. The low precision result indicates that the SVM 
classifier using the four non-semantic features as its vector set 
is unreliable and needs to be improved further. We achieve 
this by adding the semantic analysis to our classifier. 
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B. Classification with Semantic Analysis 

As described earlier, we have observed that online paid 
posters tend to post similar comments on the web, and based 
on this observation we have designed a simple method for 
semantic analysis. After integrating this semantic analysis 
method into our SVM model, we performed the test again over 
the Sohu dataset. The much improved performance results are 
shown in the last column of Table [iXl 

The results clearly demonstrate the benefit of using semantic 
analysis in the detection of online paid posters. The precision, 
recall, F-measure and accuracy have improved to 95.24%, 
73.17%, 82.76% and 88.79%, respectively. Based on these 
improved results, the semantic feature can be considered as 
a useful and important supplement to the other features. The 
reason why the semantic analysis improves performance is 
that online paid posters often try to post many comments with 
some minor edits on each post, leading to similar sentences. 
This helps the paid posters post many comments and complete 
their assignments quickly, but also helps our classifier to detect 
them. 

VII. Related Work 

In this paper, we focus on paid posters who post comments 
online to influence people's thoughts regarding popular social 
events. We characterize the basic organization structure of paid 
posters as well as their online posting patterns. To the best 
of our knowledge, this paper is the first to study the social 
phenomenon of paid posters. 

Some of previous work regarding spam detection is similar 
to ours. Researchers have done plenty of work in this area 
to design better classification mechanisms. Niu et al. fl~8l 
conducted a quantitative study of forum spamming and found 
that forum spamming is a widespread problem and also devel- 
oped a context-based detection method to identify spammers. 
Shin et al. l20l improve their work by designing a light- 
weight classifier that can be used on the forum server in real- 
time. They conducted detailed analysis on their datasets and 
identified typical features of forum spam to assist the classifier. 
Bhattarai et al |3 ] explored the characteristics of comment 
spam in blogs based on their content. In order to detect the 
comment spam, they also investigated the notion of comment 
similarity through word duplication and semantic similarity. 

The spammers in those scenarios use software to post 
malicious comments on their forums and blogs to change 
the results of search engine or to make theirs sites popular. 
However, the definition of spam has been extended to a 
much wider concept. Basically, any user whose behavior might 
interfere with normal communication or aid the spread of 
misleading information is specified as a spammer. Examples 
include forum spammers and comment spammers in social 
media. Yin et al. |26] studied so-called online harassment, 
in which a user intentionally annoys other users in a web 
community. They investigated the characteristics of a specific 
type of harassment using local features, sentimental features 
and contextual features. However, the performance of their 



detection model may still need to be improved since the 
maximum precision only reached 50%. Benevenuto et al. Q 
proposed a detection mechanism to identify malicious users 
who post video response spam on Youtube. They studied the 
performance of several major attributes which were used to 
characterize the behavior of malicious users. Gao et al. ifTOl 
conducted a broad analysis on spam campaigns that occurred 
in Facebook network. From the dataset, they noticed that the 
majority of malicious accounts are compromised accounts, in- 
stead of "fake" ones created for spamming. Such compromised 
accounts can be obtained through trading over a hidden online 
platform, according to l25ll . 

VIII. Conclusions and Future Work 

Detection of paid posters behind social events is an in- 
teresting research topic and deserves further investigation. 
In this paper, we disclose the organizational structure of 
paid posters. We also collect real- world datasets that include 
abundant information about paid posters. We identify their 
special features and develop effective techniques to detect 
them. Our classifier based on SVM, with integrated semantic 
analysis, performs extremely well on the real- world case study. 
As future work, we plan to further improve our detection 
system and extend our research to other relevant areas, such 
as network marketing. 
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